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Abstract. This paper proposes an incremental method that can be used by an 
intelligent system to learn better descriptions of a thematic context. The method 
starts with a small number of terms selected from a simple description of the topic 
under analysis and uses this description as the initial search context. Using these 
terms, a set of queries are built and submitted to a search engine. New documents 
and terms are used to refine the learned vocabulary. Evaluations performed on a 
large number of topics indicate that the learned vocabulary is much more effec- 
tive than the original one at the time of constructing queries to retrieve relevant 
material. 



1 Introduction 

Today's search engine interfaces are appropriate when the user knows what to seek 
and how to seek it. However, they are unable to reflect the user context and there- 
fore they are not smart enough to understand the real user's needs. For several years 
researchers in the Artificial Intelligent community have talked about the importance of 
intelligent systems that cooperate with the user to facilitate a number of computer medi- 
ated task II10I12I . More recently, the problem of accessing relevant information through 
intelligent systems has become a main research area. In order to implement intelligent 
Information Retrieval (IR) systems some researchers have proposed taking advantage 
of existing services to build more powerful tools on top of them 181711 . Examples of 
systems that apply this approach take advantage of major search engines to perform 
intelligent context-based search ||2|4|17|13|16l 

The Web can be regarded as a rich repository of collective memory. An intelligent 
system that incrementally searches this repository to find material that is useful to the 
user's current needs can act as a memory augmentation aid. By an association of simi- 
larities, this aid can help users remember information, assure that areas relevant to the 
current task have been considered, and pursue new directions. 
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Descriptions of a user's needs, however, are usually deficient because they are typ- 
ically based on the a priori knowledge of the topic of interest. This knowledge might 
be insufficient to formulate a good query, or more commonly, the vocabulary used by 
the user might not be appropriate to target the request at the right kind of material. In 
certain scenarios, attaining novelty and diversity may be as important, or even more 
important, than attaining similarity. For human-generated queries users frequently de- 
cide, based on initial results, to refine subsequent queries. If contextual information is 
available, part of the query formation and refinement process can be automated. 

This paper proposes a new technique for incrementally learning a better character- 
ization of the user context. The work presented here suggests and tests the following 
hypotheses: (1) the vocabulary describing the initial context can be used to identify 
semantically related documents and terms, but (2) the terms describing the initial con- 
text are not necessarily the most appropriate ones to generate search queries, and (3) 
the characterization of the search context can be incrementally improved by a semi- 
supervised learning algorithm. 

Our algorithm is based on the dynamic extraction of topic descriptors and discrim- 
inators, as first introduced in lfT4l . The main contribution of this paper is the proposal 
of a new mechanism for learning rich vocabularies associated with a thematic context. 
The learned vocabulary provides an improved characterization of the topic of interest 
in the sense that it allows to better identify topically relevant material. The effective- 
ness of our proposal is assessed by carrying out a comprehensive evaluation on a large 
collection of human-generated topic descriptions. 



2 Context Characterizations 

For many computer-mediated tasks, the user context provides a rich set of terms that can 
be exploited by intelligent systems to generate queries and present related information 
to the user. Such systems can be equipped with special monitoring capabilities, designed 
to generate a model of the user context. The system will be in charge of observing 
how the user interacts with different kinds of computer utilities (such as email systems, 
browsers and text editors) to characterize the user's information needs as a collection of 
weighted terms. This requires a framework for learning context-specific terms. 



2.1 The Different Role of Terms 

A central question addressed in our work is how to learn context-specific terms based on 
the user current context and an open collection of incrementally retrieved documents. In 
what follows, we will assume that a user context is represented as a set of terms. Con- 
sider for example a topic involving the Java Virtual Machine. Context-specific terms 
may play different roles. For example, the term java is a good descriptor of the topic for 
a general audience. However, java is not a good discriminator for that topic because it 
might also refer to the island in Indonesia, the java shark, a brand of Russian cigarettes 
or a variety of coffee grown on the island of Java, among other possibilitiesQ 



Wikipedia disambiguation page presents more than 50 senses for the word java. 



If we reconsider the topic Java Virtual Machine we notice that terms such as jvm and 
jdk — which stand for "Java Virtual Machine" and "Java Development Kit" — may not 
be good descriptors of the topic for a general audience, but are effective in bringing in- 
formation that is relevant for our topic of interest when presented in a query. Therefore, 
jvm and jdk are good discriminators of that topic. 

A natural question that arises in this scenario is how to identify the terms that act as 
good descriptors and good discriminators of a topic. In previous work 11411 II we have 
studied and tested the following two hypotheses: 

- Good topic descriptors can be found by looking for terms that occur often in docu- 
ments related to the given topic. 

- Good topic discriminators can be found by looking for terms that occur only in 
documents related to the given topic. 

Both topic descriptors and discriminators are important as query terms. Because 
topic descriptors occur often in relevant pages, using them as query terms may improve 
recall. Similarly, good topic discriminators occur primarily in relevant pages, and there- 
fore using them as query terms may improve precision. 



2.2 Computing Topic Descriptors and Topic Discriminators 

As a first approximation to compute descriptive and discriminating power, we begin 
with a collection of m documents and n terms. As a starting point we build an m x n 
matrix H, such that H[ t, j] = fc if A: is the number of occurrences of term tj in document 
di. In particular we can assume that one of the documents (e.g., do) corresponds to the 
initial user context. 

The matrix H allows us to formalize the notions of good descriptors and good 
discriminators. We define descriptive power of a term in a document as a function 
A : {do,...,dm-i} X {tg, • ■ ■ , [0,1]: 



Note that A can be regarded as a version of matrix H normalized by row (i.e, by 
document). 

If we adopt s(fc) = 1 whenever A: > and s(fc) = otherwise, we can define 
the discriminating power of a term in a document as a function 5 : {^o, ■ • ■ , ^n-i} x 
{do,...,dm-i} [0,1]: 



Ero's(H[fc,i]) 

In this case 6 can be regarded as a transposed version of matrix H normalized by column 
(i.e, by term). 

Our current goal is to learn a better characterization of the user needs. Therefore 
rather than extracting descriptors and discriminators directly from the user context, we 
want to extract them from the topic of the user context. This requires an incremental 
method to characterize the topic of the user context, which is done by identifying doc- 
uments that are similar to the user current context. Assume the user context and the 



retrieved documents are represented as document vectors in term space. To determine 
how similar two documents di and dj are we adopt the IR cosine similarity |[T|. This 
measure is defined as a function cr : {do, ■ • • , d„i-i} x {do, . . . , dm-i} [0, 1]: 

a{di,dj) = ^[A(di,tfc) • X{dj,tk)]- 

fc=0 

We formally define the temi descriptive power in the topic of a document as a 
function A : {dp, . . . , dm-i} x {tg, ■ • ■ , in-i} — >■ [0,1]- We set A{di,tj) = if 
J2T~o^ dk) = 0. Otherwise we define A{di,tj) as follows: 

j:z-o'Hd^,dk)-[m,t,)f] 

A{di,tj) = — , . 

Z^fc=o '^{di,dk) 

Thus, the descriptive power of a term tj in the topic of a document di is a measure of 
the quality of tj as a descriptor of documents similar to di. 

Analogously, we define the discriminating power of a term in the topic of a docu- 
ment as a function A : {to, • ■ • , ^n-i} x {dp, . . . , dm-i} [0, 1] calculated as follows: 

A{U,d,) = Y.i:=<^mU,dk)f ■ a[dk,d,)]. 

Thus the discriminating power of term ti in the topic of document dj is an average 
of the similarity of dj to other documents discriminated by ti. For a worked example 
showing the results of computing topic descriptors and discriminators see ifTTI . 

3 An Algorithm for Context Enrichment through Vocabulary 
Leaps 

Attempting to find an optimal set of terms to characterize the user thematic context 
gives rise to a combinatorial problem. This is not only intractable but unreasonable 
from a pragmatic point of view. Instead, we propose to apply an intelligent IR strategy 
to explore and exploit potentially useful vocabularies. Assume the vocabulary defines a 
landscape, where the initial context is a given region of this landscape. In this scenario, 
exploitation means to thoroughly explore a given set of terms in order to find local 
optima, i.e., the best descriptors and discriminators based on a given characterization of 
the current context. Exploration, on the other hand, refers to probe new regions of the 
landscape, which is dynamically discovered by performing incremental search, in the 
hope of finding either better descriptors or better discriminators and therefore a better 
characterization of the thematic context. 

Many machine learning techniques that apply the exploration-exploitation strategy 
(e.g., simulated annealing and reinforcement learning) attempt to diversify (i.e., to ex- 
plore) during initial generation and to focus (i.e., to exploit) towards the end. In our 
approach we take a different approach and propose an algorithm that evolves topic de- 
scriptors and discriminators by alternating the exploration and the exploitation of the 
vocabulary landscape. We begin by exploiting the initial vocabulary by focusing on the 




Fig. 1. A schematic illustration of the proposed mechanism for learning better context 
characterizations. 



initial context. This vocabulary is used to iteratively form queries that are submitted to 
a search engine. If after a certain number of iterations there are no significant improve- 
ments on the search results, our algorithm performs a phase change to explore new 
potentially useful regions of the vocabulary landscape. A phase change can be regarded 
as a vocabulary leap, which can be thought of as a significant transformation (typically 
an improvement) of the context characterization. A schematic illustration of the pro- 
posed mechanism for learning better context characterization is shown in figure [T] and 
is summarized in the following steps: 

1. Let C be the initial context description. 

2. Set Co = C. 

3. i <— 0, repeat 

(a) Start phase Vi 

(b) Set A.^ = and A<,i = 

(c) j <— 1, repeat 

i. Start Vi evolution, £]. 

ii. Set Q equal to some combination of context terms0 

iii. do Search with Q. 

iv. Make lists of topic descriptors and discriminators, A' and A' , based on search 
results and d . 

V. Update A^i and A^i : 

- {A,.\A,.} ^a{A,. \A,, } + P{A'\A'}. 

vi. Analyze the documents' similarit)|j to d every u iterations: 

- If there is a low variation {6 < /i), end f . Return Ai = A^i and Ai = A^i . 

J j 

- If the process has run for at least v iterations and there is a very low variation 
{6 < v), end Vi and goto|4] 

vii. j j + 1- 

(d) Update d with terms containing high Ai and Ai values to obtain d+i. 

(e) Let uig*: represent the weight of term tk in context d. 

(f) Set the terms weights to^*: — 711;^': + ^"^^a- + ^"^^a- 

(g) i^i + l. 

4. End process. 

^ See section|4]for details on how the combination was implemented in our tests. 
' In order to test our algorithm we used the measure of novelty-driven similarity defined in 
section|4]for reasons that will become obvious in that section. 



4 Evaluation 



The goal of this section is to provide empirical evidence supporting the hypotheses 
postulated in section [T] We show that the proposed algorithm can help enrich the topic 
vocabulary and that the learned vocabulary allows to generate queries that result in 
better retrieval performance than queries generated directly from the initial vocabulary. 

Toperform our tests we used nearly 500 topics from the Open Directory Project 
(ODPjj. The topics were selected from the third level of the ODP hierarchy. A number 
of constraints were imposed on this selection with the purpose of ensuring the quality of 
our test set. The minimum size for each selected topic was 100 URLs and the language 
was restricted to English. For each topic we collected all of its URLs as well as those in 
its subtopics. The total number of collected pages was more than 350000. The Terrier 
framework ifTSl was used to index these pages and to run our experiments. 

In our tests we used the ODP description of each selected topic to create an ini- 
tial context description C. The proposed algorithm was run for each topic for at least 
V = 100 iterations, with 10 queries per iteration and retrieving 10 results per queries. To 
create the queries Q at each iteration we used the roulette selection mechanism. Roulette 
selection is a technique typically used by Genetic Algorithms to choose potentially 
useful solutions for recombination, where the fitness level is used to associate a prob- 
ability of selection. In our case, the fitness level was determined by the descriptive or 
discriminating power values of the terms. The descriptor and discriminator lists were 
limited to up to 100 terms each. The other parameters in our algorithm were set as fol- 
lows: u = 10, a=0.5, 13=0.5, 7=0.33, C=0.33, ^=0.33, ^=0.2 and i^=0.1. In addition, we 
used the stopword list provided by Terrier, Porter stemming was performed on all terms 
and none of the query expansion methods offered by Terrier was applied. 

To analyze the evolution of the context vocabulary we propose here a revised notion 
of similarity. This measure of similarity is based on a but disregards the terms that form 
the query, favoring the exploration of new material. Given a set of queries {qq, . . . , qp} 
we define a novelty-driven similarity measure : {qo, . . . , qp} x {do, . . . , dm-i} x 
{do, . . .,dm-i} [0, 1] as: 

o-^(q, d,, dj) = (T(dj - q, dj - q) 

The notation di — q stands for the representation of the document dj with all the values 
corresponding to the terms from query q set to zero. The same applies to dj — q. 

We computed the novelty-driven similarity measure cr^ between the initial context 
(topic descriptions) and the retrieved results. The goal was to investigate the impact 
that each phase change had on the query performance. Figure |2] shows the evolution 
of the novelty-driven similarity for the topics Top/Home/Cooking/ForXhildren 
and Top/Computers/Open_Source/Software|l We used the minimum, average and 
maximum novelty-driven similarity between the initial context and the search results at 
each iteration to illustrate the evolution of the context vocabulary. It is worth noticing 
that the vocabulary leaps that generally take effect every 10 iterations have an important 

* http://dmoz.org 

^ For figures showing the evolution of the novelty-driven similarity for each analyzed topic visit 
|http : //cs ■ uns ■ edu ■ ar/ ~cml/group/ SimsCLEIOS ■ htm| 
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Fig. 2. The evolution of minimum, average and maximum novelty-driven 
similarities for the topics Top/Home/Cooking/For_Children (left) and 
Top/Computers/Open_Source/Sof tware (right). 



impact on the quality of the retrieved material. This provides evidence that the proposed 
algorithm can help enrich the topic vocabulary. 
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Fig. 3. Comparison of query performance for the first iteration vs. the best iteration. 



After observing that our algorithm had an impact on the retrieval performance, our 
next step was to quantify this impact. With that purpose we computed four measures 
of query quality for the queries formed using the initial vocabulary and for the queries 
constructed using the evolved vocabulary. The measures used for this performance com- 
parison are (1) maximum novelty-driven similarity, (2) precision (fraction of retrieved 
documents which are known to be relevant), (3) recall (fraction of known relevant doc- 
uments which were effectively retrieved), and (4) the harmonic mean Fl (a measure 
which combines recall and precision). For a detailed description of these well-known 
performance metrics we refer the reader to any IR textbook (e.g., JTI). It is worth men- 
tion that the relevant set for each analyzed topic was set as the collection of its URLs as 
well as those in its subtopics. 

The charts in figure |3] compare the performance of queries using the initial vo- 
cabulary against queries using the evolved vocabulary. Each of the topics corresponds 
to a trial and is represented by a point. The point's horizontal coordinate corresponds 
to the performance of the queries at the first iteration (initial vocabulary), while the 
vertical coordinate corresponds to the performance of the queries at the best iteration 
(evolved vocabulary). The points above the diagonal corresponds to cases in which an 
improvement is observed for the evolved vocabulary. In this evaluation, queries con- 
structed using the evolved vocabulary outperform the initial ones in 100% of the cases 
for novelty-driven similarity, 89.18% of the cases for precision, 89.38% of the cases 
for recall, and 89.38% of the cases for the harmonic mean Fl. It is interesting to note 
that for all the topics analyzed the system managed to identify a better context char- 
acterization as evidenced by the 100% improvement for the novelty-driven similarity 
performance metric. This highlights the usefulness of evolving the context vocabularies 
to discover good query terms. 

Novelty-driven similarity and precision are useful metrics at the time of evaluating 
the performance of IR systems that recover a few pages out of a large set of relevant 
documents. This is the case for our particular scenario and therefore we can use these 
two metrics to statistically analyze the improvements achieved by the proposed algo- 
rithm. In table [T] we present the means and confidence intervals resulting from this 
analysis. These comparison tables show that the use of an evolved vocabulary results in 
statistically significant improvements over the use of the initial vocabulary. 



a'' 


N 


Mean 


95% CI 


first iteration 
best iteration 


449 
449 


0.0661 
0.5970 


[0.0618;0.0704] 
[0.5866;0.6073] 



Precision 


N 


Mean 


95% CI 


first iteration 
best iteration 


449 
449 


0.2662 
0.3538 


[0.2461;0.2863] 
[0.3318;0.3757] 



Table 1. Statistical analysis comparing query performance for the initial vocabulary 
(first iteration) vs. query performance for the evolved vocabulary (best iteration). 

5 Related Work 

Extensions to basic IR approaches have examined some of the issues raised in this 
paper For instance, some automatic relevance feedback techniques, such as the Roc- 
chio's method fTSl . make use of the full search context for query refinement. In these 
approaches the original query is expanded by adding a weighted sum of terms corre- 
sponding to relevant documents, and subtracting a weighted sum of terms from irrele- 
vant documents. As a consequence the terms that occur often in documents similar to 



the input topic will be assigned the highest rank, as in our descriptors. However, our 
technique also gives priority to terms that occur only in relevant documents and not 
just to those that occur often. In other words, we prioritize terms for both discrimi- 
nating and descriptive power The techniques for query term selection proposed in this 
paper share insights and motivations with other methods for query expansion and refine- 
ment II19I3I . However, systems applying these methods differ from our framework in 
that they support this process through a query or browsing interfaces requiring explicit 
user intervention, rather than formulating queries automatically. 

Our techniques rely on the notions of document similarity to discover higher-order 
relationships in collections of documents. This relates to the use of LSA [61 to uncover 
the latent relationships between words in a collection. Less computationally expensive 
techniques are based on mapping documents to a kernel space where documents that do 
not share any term can still be close to each other [5j. Another corpus-based technique 
that has been applied to estimate semantic similarity is PMI-IR ll20l . which measures 
the strength of association between two elements (e.g., terms) by contrasting their ob- 
served frequency against their expected frequency. Differently from our proposal, the 
goal of these techniques is to estimate the semantic distance between terms and docu- 
ments, without identifying topic descriptors and discriminators. 

6 Conclusions 

In this paper we have presented an intelligent IR approach for learning context-specific 
terms. Based on this approach, an intelligent system can take advantage of the infor- 
mation available in the user context to perform search on the Web or other information 
retrieval systems. We have shown that the user context can be usefully exploited to ac- 
cess relevant material. However, terms that occur in the user context are not necessarily 
the most useful ones. In light of this we have proposed an incremental method for con- 
text refinement based on the analysis of search results. We also distinguish two natural 
notions, namely topic descriptors and topic discriminators. The proposed notions are 
useful for meaning disambiguation and therefore can help deal with the problem of 
polysemy. Our evaluations show the effectiveness of incremental methods for learning 
better vocabularies and for generating better queries. 

Learning better vocabularies is a way to increase the awareness and accessibility 
of useful material. We have proposed a promising method to identify the need behind 
the query, which is one of the main goals for many current and next generation Web 
services and tools. As part of our future work we expect to investigate different parame- 
ter settings for the proposed algorithm and to develop methods that automatically learn 
and adjust these parameters. In addition, we expect to run additional tests comparing 
our approach with other existing query refinement mechanisms. 
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